Data Cleaning with Pandas
In this lecture, we covered a basic data cleaning process using pandas. We focused on cleaning a dataset containing a list of US presidents from Wikipedia. The key operations included importing the dataset, cleaning up names, and converting date formats. Below are the detailed steps and methods used.
Cleaning the 'President' Column
Using Regex and Replace
First, we created a new column for the first names by copying the 'President' column and using the replace() function with a regex pattern.
# Create a new column 'First'
df["First"] = df['President']
# Use regex to remove last names
df["First"] = df["First"].replace(" .*", "", regex=True)
# Display the first few rows of the dataframe
df.head()
This method was not efficient, so we explored other approaches.
Using apply() Function
The apply() function allows us to apply a custom function to each row or column in a DataFrame. We wrote a function to split the 'President' column into first and last names.
# Drop the previously created column
del df["First"]
# Define the splitname function
def splitname(row):
row['First'] = row['President'].split(" ")[0]
row['Last'] = row['President'].split(" ")[-1]
return row
# Apply the function to the dataframe
df = df.apply(splitname, axis='columns')
df.head()
Using extract() Function with Regular Expressions
The extract() function can be used to extract data using regular expressions with capture groups. This method is efficient and leverages vectorized operations.
# Drop the previously created columns
del df['First']
del df['Last']
# Define the regex pattern
pattern = "(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)"
# Extract the first and last names
names = df["President"].str.extract(pattern)
# Assign the extracted names to the dataframe
df["First"] = names["First"]
df["Last"] = names["Last"]
df.head()
Cleaning the 'Born' Column
To clean the 'Born' column, we used the extract() function to match the date format and then converted the column to datetime type.
# Extract the date in the pattern 'Month Day, Year'
df["Born"] = df["Born"].str.extract("([\w]{3} [\w]{1,2}, [\w]{4})")
# Convert the 'Born' column to datetime type
df["Born"] = pd.to_datetime(df["Born"])
df["Born"].head()
Summary
- Importing Libraries: We imported pandas and loaded the dataset.
- Cleaning Names:
- Used
replace()with regex to clean the 'President' column. - Utilized the
apply()function to split names. - Employed the
extract()function with regex for efficient name extraction.
- Used
- Cleaning Dates:
- Used
extract()to clean the 'Born' column. - Converted the 'Born' column to datetime type for better date manipulation.
- Used
Useful Resources
For more details on the pandas str module and its functions, refer to the Pandas Documentation on Working with Text.